updated: 2021-04-28
| level | n_samples | n_features | n_features_preproc |
|---|---|---|---|
| phylum | 490 | 19 | 9 |
| class | 490 | 36 | 19 |
| order | 490 | 65 | 28 |
| family | 490 | 124 | 54 |
| genus | 490 | 316 | 115 |
| otu | 490 | 20079 | 705 |
| asv | 490 | 104106 | 478 |
Plots of HP performance for each model/taxonomic level
The default hyperparameter (mtry) selection for RF is:
Based on previous results, I added mtry=1 for Phylum through Genus levels and mtry=100 for OTU and ASV levels. This set of hyperparameters seems to cover the optimal range for each taxonomic level.
Two metrics: alpha and lambda
By default alpha is set to zero (for L2 regularization) and lambda values are: 1e-04, 1e-03, 1e-02, 1e-01, 1e+00, 1e+01 I added 1e-05 as a lambda value for phylum and class, and higher values of lambda for the rest
The default hyperparameters were retained for decision tree (maxdepth = 1 2 4 8 16 30). It would not allow larger maxdepth than 30.
(needs adjustment)
Since there are some that believe ASVs generated by Mothur are not as good as ASVs generated by DADA2, I ran a comparison. I used DADA2 (v1.18.0) to generate ASVs and ran them through the same mikropml model pipeline. I ran dada(…, pool=T) to caputure more rare ASVs and subsampled to XX reads after processing.
Below is a summary table of the number of features at each level. Interestingly there are only 5,508 ASVs identified with DADA2 compared to 104,106 with Mothur. After preprocessing there are just 630 ASVs, about half of what we find with Mothur.
| level | n_samples | n_features | n_features_preproc |
|---|---|---|---|
| phylum | 490 | 19 | 9 |
| class | 490 | 36 | 19 |
| order | 490 | 65 | 28 |
| family | 490 | 124 | 54 |
| genus | 490 | 316 | 115 |
| otu | 490 | 20079 | 705 |
| asv | 490 | 104106 | 478 |
| dada2 | 490 | 5508 | 630 |
Random Forest on OTU level data yields the highest median AUC, followed by Family, Genus, and ASV. While the OTU AUC is not significantly higher than that of Family or Genus level, it is significantly higher than ASV.
| Level | Median AUC |
|---|---|
| phylum | 0.585 |
| class | 0.605 |
| order | 0.659 |
| family | 0.687 |
| genus | 0.686 |
| otu | 0.698 |
| asv | 0.676 |
| Level | Median AUC |
|---|---|
| phylum | 0.587 |
| class | 0.590 |
| order | 0.609 |
| family | 0.604 |
| genus | 0.604 |
| otu | 0.616 |
| asv | 0.622 |
| Level | Median AUC |
|---|---|
| phylum | 0.585 |
| class | 0.569 |
| order | 0.577 |
| family | 0.604 |
| genus | 0.595 |
| otu | 0.583 |
| asv | 0.566 |
| Level | Median AUC |
|---|---|
| phylum | 0.571 |
| class | 0.559 |
| order | 0.613 |
| family | 0.615 |
| genus | 0.616 |
| otu | 0.619 |
| asv | 0.628 |
| Level | Median AUC |
|---|---|
| phylum | 0.608 |
| class | 0.623 |
| order | 0.657 |
| family | 0.668 |
| genus | 0.651 |
| otu | 0.672 |
| asv | 0.648 |
The mikropml packages includes an option for finding feature importance using a permutation method. This significantly increases the time to run the models. The function permutes each model feature (e.g. genus) and recalculates the AUC with that feature permuted. The outputs are the permuted AUC value and the difference between this AUC value and the actual AUC value. Larger positive differences can be interpreted as being more important since the AUC value is smaller when this feature is permuted.
The importance values are all from the Random Forest model.
| Rank | Phylum | Class | Order | Family | Genus | OTU | ASV |
|---|---|---|---|---|---|---|---|
| 1 | Fusobacteria | Fusobacteriia | Fusobacteriales | Clostridiales_Incertae_Sedis_XI | Porphyromonas | Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Porphyromonas(100); | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiaceae_1(100);Clostridium_sensu_stricto(100); |
| 2 | Bacteroidetes | Betaproteobacteria | Synergistales | Bacteroidaceae | Bacteroides | Bacteria(100);Fusobacteria(100);Fusobacteriia(100);Fusobacteriales(100);Fusobacteriaceae(100);Fusobacterium(100); | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100); |
| 3 | Synergistetes | Negativicutes | Bacillales | Lachnospiraceae | Gemella | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Lachnospiraceae_unclassified(100); | Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Prevotellaceae(100);Prevotella(100); |
| 4 | Actinobacteria | Bacteroidia | Coriobacteriales | Synergistaceae | Fusobacterium | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Ruminococcus(59); | Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Porphyromonas(100); |
| 5 | Deinococcus-Thermus | Synergistia | Burkholderiales | Bacillales_Incertae_Sedis_XI | Ruminococcus | Bacteria(100);Firmicutes(100);Erysipelotrichia(100);Erysipelotrichales(100);Erysipelotrichaceae(100);Coprobacillus(100); | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Ruminococcaceae_unclassified(100); |
| 6 | Verrucomicrobia | Firmicutes_unclassified | Selenomonadales | Coriobacteriaceae | Peptoniphilus | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Lachnospiraceae_unclassified(100); | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Blautia(100); |
| 7 | Firmicutes | Deltaproteobacteria | Clostridia_unclassified | Enterobacteriaceae | Anaerostipes | Bacteria(100);Firmicutes(100);Bacilli(100);Bacillales(100);Bacillales_Incertae_Sedis_XI(100);Gemella(100); | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100); |
| 8 | Proteobacteria | Verrucomicrobiae | Actinomycetales | Clostridia_unclassified | Clostridium_XlVb | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Lachnospiraceae_unclassified(100); | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Ruminococcaceae_unclassified(100); |
| 9 | Bacteria_unclassified | Bacteroidetes_unclassified | Verrucomicrobiales | Desulfovibrionaceae | Akkermansia | Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Bacteroidaceae(100);Bacteroides(100); | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Lachnospiraceae_unclassified(100); |
| 10 | NA | Bacilli | Desulfovibrionales | Clostridiales_unclassified | Pseudoflavonifractor | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Ruminococcus(70); | Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Ruminococcaceae_unclassified(100); |